Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default stopwords list should be _none_ for all but language-specific analyzers #4699

Closed
clintongormley opened this issue Jan 12, 2014 · 3 comments · Fixed by #4705
Closed

Comments

@clintongormley
Copy link

/cc @s1monw

In #4092 the standard analyzer's default stopwords list was changed from english to _none_. The reasons for this are:

  1. Removing stopwords on any string field by default can have surprising results, eg country_code: "NO" would be indexed with no value, title: "To be or not to be" would similarly have all words removed.
  2. Stopwords do add value to search, and with tools like the common query, we can take stopwords into account while still keeping queries performing well.
  3. Choosing the English stopwords by default can be surprising for users whose primary language isn't English.

However, there are two other non-language-specific analyzers which should have a similar treatment, specifically:

  • pattern analyzer
  • standard_html analyzer

Also, the change to the standard analyzer has not been documented, and the standard_html analyzer is not documented at all.

@clintongormley
Copy link
Author

Correction: the standard analyzer docs do reflect the change to stopwords in 1.0

@s1monw
Copy link
Contributor

s1monw commented Jan 13, 2014

+1 I agree we should fix pattern and standard_html as well

@s1monw
Copy link
Contributor

s1monw commented Jan 13, 2014

actually I think we should drop the standard_html_strip analyzer entirely and let folks configure it themself

@ghost ghost assigned s1monw Jan 13, 2014
@s1monw s1monw closed this as completed in 7f63ddf Jan 13, 2014
brusic pushed a commit to brusic/elasticsearch that referenced this issue Jan 19, 2014
…ic analyzers

`standard_html_strip` and `pattern` analyzer support stopwords which are
set to the default `english` stopwords by default. Those analyzers
should not use stopwords by default since they are language neutral

Closes elastic#4699
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants